This is the example script (partly modified) described in “Explainable artificial intelligence enhances the ecological interpretability of black-box species distribution models” by Ryo et al. The original code can be found here
The data species occurrence and climate data are taken form here: https://www.gbif.org/ using the {sdmbench} R package
library(sf)
library(mapview)
library(sp)
library(maptools)
library(sdmbench)
library(mlr)
library(iml)
library(lime)
library(dplyr)
library(ggplot2)
library(rsample)
library(withr)
library(RStoolbox)
library(DT)
dictionary = read.csv("worldclim_dictionary.csv")
Rare species, lives on calcareous rocks. Meso and microclimate as well as topography play a crucial role in driving its presence.
species = "Campanula morettiana"
climate_resolution = 2.5
# Get and prepare data ----------------------------------------------------
set.seed(42)
# this function downloads and stores the data, and to make the rest of the script reproducible (GBIF data can be updated with new observations) we are loading a stored static dataset
occ_data_raw <-
get_benchmarking_data(species, limit = 1000,climate_resolution = climate_resolution)
## [1] "Getting benchmarking data...."
## [1] "Cleaning benchmarking data...."
## OGR data source with driver: ESRI Shapefile
## Source: "/tmp/RtmpEdZyLZ", layer: "ne_50m_land"
## with 1420 features
## It has 3 fields
## Integer64 fields read as strings: scalerank
## [1] "Done!"
colnames(occ_data_raw$df_data) = c(dictionary$Abbreviation,"label")
names(occ_data_raw$raster_data$climate_variables) <- dictionary$Abbreviation
occ_data <- occ_data_raw$df_data
occ_data$label <- as.factor(occ_data$label)
coordinates_df <- rbind(occ_data_raw$raster_data$coords_presence, occ_data_raw$raster_data$background)
occ_data <- normalizeFeatures(occ_data, method = "standardize")
occ_data <- cbind(occ_data, coordinates_df)
occ_data <- na.omit(occ_data)
occ_sf <- st_as_sf(occ_data,coords = c("x", "y"), crs = 4326, agr = "constant")
occ_sf = occ_sf %>% filter(label == 1)
mapview()+mapview(occ_sf,zcol = "label")
set.seed(42)
train_test_split <- initial_split(occ_data, prop = 0.7)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
data_train$x <- NULL
data_train$y <- NULL
data_test_subset <- data_test %>% filter(label == 1)
Show performance and variable importance
task <-
makeClassifTask(id = "model", data = data_train, target = "label")
lrn <- makeLearner("classif.randomForest", predict.type = "prob")
mod <- train(lrn, task)
pred <- predict(mod, newdata=data_test)
VIMP <- as.data.frame(getFeatureImportance(mod)$res)
# Top n important variables
top_n(VIMP, n=5, importance) %>%
ggplot(., aes(x=reorder(variable,importance), y=importance))+
geom_bar(stat='identity')+ coord_flip() + xlab("")
# Performance
performance(pred, measures=auc)
## auc
## 0.9792238
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
predictor <-
Predictor$new(mod, data = data_train, class = 1, y = "label")
ale <- FeatureEffect$new(predictor, feature = "Temp_seasonality")
ale$plot() +
theme_minimal() +
ggtitle("ALE Feature Effect") +
xlab("Temperature seasonality")
predictor <-
Predictor$new(mod, data = data_train, class = 1, y = "label")
ale <- FeatureEffect$new(predictor, feature = "Annual_precip")
ale$plot() +
theme_minimal() +
ggtitle("ALE Feature Effect") +
xlab("Annual precipitation")
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
# resampling
sample_data <- withr::with_seed(10, sample_n(data_test_subset, 3))
sample_data_coords <- dplyr::select(sample_data, c("x", "y"))
sample_data$x <- NULL
sample_data$y <- NULL
set.seed(42)
explainer <- lime(data_train, mod)
set.seed(42)
explanation <-
lime::explain(sample_data,
explainer,
n_labels = 1,
n_features = 5)
plot_features(explanation,ncol = 1)+xlab(NULL)
customPredictFun <- function(model, data) {
v <- predict(model, data, type = "prob")
v <- as.data.frame(v)
colnames(v) <- c("absence", "presence")
return(v$presence)
}
normalized_raster <- RStoolbox::normImage(occ_data_raw$raster_data$climate_variables)
pr <-
dismo::predict(normalized_raster,
mlr::getLearnerModel(mod, TRUE),
fun = customPredictFun)
coordinates(sample_data_coords) <- ~ x + y
sl1 <- list("sp.points", sample_data_coords, pch=1, cex=1.2, lwd=2, col="white")
sl2 <- list("sp.pointLabel", sample_data_coords,
label=c("Case 1","Case 2","Case 3"),
cex=0.9, col="white", fontface=2)
rf_map <-
spplot(pr, main = "Habitat Suitability Map",
scales = list(draw = TRUE),
sp.layout = list(sl1,sl2),
labels=TRUE
)
rf_map
An example of an alpine Grasshopper
For the next species we will only change the following lines of code:
species = "Miramella alpina"
climate_resolution = 10
Show performance and variable importance
## auc
## 0.9515574
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
A mediterranean plant species
For the next species we will only change the following lines of code:
species = "Asparagus acutifolius"
climate_resolution = 10
Show performance and variable importance
## auc
## 0.9284497
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
DT::datatable(dictionary,rownames = FALSE)